Overview and Motivation

Sabermetrics has become quite popular, especially in more recent years. Researchers have developed metrics that are widely used to gauge player performance. The search for more reliable and accurate measures is still ongoing. Specifically, teams are interested in buying wins. We will begin our search by analyzing an increasingly popular but not officially recognized metric: wins above replacement (WAR), which attempts to be a single metric which directly quantifies how many wins a player is worth to a team. However, as a statistic not officially recognized by Major League Baseball, there is no universal way of calculating WAR, with the three most popular versions coming from Baseball Reference, Baseball Prospectus, and Fan Grpah. But how well do these three versions compare in predicting true wins for a team from team total WAR? We address this question, and provde some further analysis. In doing so, we also aim to derive a more “comprehensive” metric, one that is a combination of the three to improve results.

Initial Questions

Some initial questions we wish to answer include: - What is the relationship between WAR and the number of wins a team sees? - How accurately does WAR predict the number of wins a team observes in a given year? - Are there any systematic issues with WAR in predicting true wins? - How can the three different ways to calculate WAR be combined to produce a more reliable statistic? - What are some efficient techniques that one could use to scrape data from Fangraphs, Baseball Prospectus, and Baseball Reference, all of which are commonly visited websites in the baseball world?

Data

WAR, again a non-standard metric, is based on a number of factors including batting runs, baserunning runs, runs added or lost due to grounding into double plays in double play situations, fielding runs, positional adjustment, replacement level, and even stadium adjustments. Each of our sources do this in a slightly different way.

Our data was obtained from the Teams dataset in the Lahman package (there is no WAR statistic in Lahman) and from three external sources: Baseball Prospectus, Baseball Reference, and Fangraphs. Due to the different formats of the three sources, the issues encountered and the techniques to solve them (detailed below), data scraping proved a large effort, and still required extensive cleaning to arrive at a complete and usable dataset.

Baseball Prospectus

Baseball Prospectus was the least challenging of the three for data wrangling. All players for each season were displayed on a single page with their respective team and BWARP (Baseball Prospectus’s version of WAR) annually for each year from 1871 to 2017 as can be seen in the image below. To scrape year-specific data, we vectorized over the range of years and used the function set_values and submit_form to automate the selection process and extracted the table of interest by identifying the corresponding html_node. The code has been provided below (though we recommend caution in running any of our data wrangling code blocks as they can take on the order of hours to run; we wrote all of our wrangled data to csv files for much faster data access in the later analyses).

url_batting <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=2022181"
url_pitching <- "http://legacy.baseballprospectus.com/sortable/index.php?cid=1931167"

prospectus <- function(year, url){
  year = as.character(year)
  pgsession <- html_session(url)
  pgform <- html_form(pgsession)[[3]]
  
  #Fill in selection form
  filled_form <-set_values(pgform,
                           "year" = year
  )
  
  #Submit selection form
  d <- submit_form(session = pgsession, form = filled_form)
  dat <- d %>%
    html_nodes("table") %>%
    .[[5]] %>%
    html_table(header = Truth)
  dat
}

#Submit Year = X for X in [1871, 2017] to obtain year-specific data
prospectus_batting <- do.call(rbind, lapply(1871:2017, prospectus, url = url_batting))
prospectus_pitching <- do.call(rbind, lapply(1871:2017, prospectus, url = url_pitching))

#Reformat data 
prospectus_pitching <- read.csv("prospectus_pitching_full.csv")
prospectus_batting <- read.csv("prospectus_batting_full.csv")

prospectus_pitching_small <- prospectus_pitching[, c("NAME", "TEAM", "YEAR", "PWARP")]
names(prospectus_pitching_small) <- c("Name", "Team", "Year", "WARP")
prospectus_batting_small <- prospectus_batting[, c("NAME", "TEAM", "YEAR", "BWARP")]
names(prospectus_batting_small) <- c("Name", "Team", "Year", "WARP")

Baseball Reference

The data on Baseball Reference are grouped by team and years. Separate tables are displayed for batters and pitchers. The screenshot below provides an example of one of the tables shown for the New York Yankees in 2017.

Links to these tables follow a general pattern, so a vector of all the possible links was created. We then looped through all of these links to grab the tables from each page. Final touches were made to get the datasets into the right format (agin, the computation time on the was very large).

teams <- unique(Teams$franchID)
years <- 1871:2017

urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
  for(j in 1:length(years)) {
    urls[i, j] <- paste0("https://www.baseball-reference.com/teams/", teams[i], "/", years[j], ".shtml")
  }
}
url_vector <- as.vector(urls)

list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:5000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 5001:10000) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

for(i in 10001:17640) {
  url <- url_vector[i]
  
  res <- try(readLines(url), silent = TRUE)
  
  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

## find indices where the link exists
ind_batting <- which(!is.na(list_of_batting))
ind_pitching <- which(!is.na(list_of_pitching))

## find links that exist
url_batting <- url_vector[ind_batting]
url_pitching <- url_vector[ind_pitching]

## extract year from each url
years_batting <- as.numeric(unlist(regmatches(url_batting, gregexpr("[[:digit:]]+", url_batting))))
years_pitching <- as.numeric(unlist(regmatches(url_pitching, gregexpr("[[:digit:]]+", url_pitching))))

## extract team from each url
teams_batting <- basename(dirname(url_batting))
teams_pitching <- basename(dirname(url_pitching))

## remove NAs from lists
na.omit.list <- function(y) { 
  return(y[!sapply(y, function(x) all(is.na(x)))]) 
}

test_batting <- na.omit.list(list_of_batting)
test_pitching <- na.omit.list(list_of_pitching)

## add columns for year and team
test_batting <- mapply(cbind, test_batting, "Year" = years_batting, 
                       "Team" = teams_batting, SIMPLIFY = F)
test_pitching <- mapply(cbind, test_pitching, "Year" = years_pitching,
                        "Team" = teams_pitching, SIMPLIFY = F)

bbref_batting <- bind_rows(test_batting)
bbref_pitching <- bind_rows(test_pitching)

Fan Graphs

On Fan Graph, data is available for each player for each of the current 30 MLB teams as far back as their respective foundings. The URLs for these pages were in a standard format though, like Baseball Prospectus, there is a different page for each team for offensive players (i.e. hitters) and pitchers. A challenge unique to the Fan Graph data scraping was that the pages were generated via javascript, so simply connecting to the site with R to get the HTML for the pages would not actually include the required data, rather just the format of the page. To remedy this, a short PhantomJS script was written to initiate the javascript on the Fan Graph site. From within R, we can edit this script (specifically the var url) to get all players (both pitchers and hitters) and all teams. Then, we simply parse the HTML code, get the WAR values for each player, sum them together for each team for each year, reformat the data, and finally save it (team, year, total WAR) to a csv file which we can load for our analysis.

var url ='https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=8&season=2017&month=0&season1=2017&ind=0&team=30&rost=0&age=0';
var page = new WebPage()
var fs = require('fs');


page.open(url, function (status) {
        just_wait();
});

function just_wait() {
    setTimeout(function() {
               fs.write('1.html', page.content, 'w');
            phantom.exit();
    }, 2500);
}
get_WAR_from_team_year=function(team, year)
{   
    
    #Batters first
    prep_jc_url(team, year, 1)
    system("phantomjs scrap_final.js")
    myHtml <- readLines("1.html")
    myHtml_sub = myHtml[ grep('LeaderBoard1_dg1_ctl00', myHtml)[1]: grep('LeaderBoard1_dg1_SharedCalendarContainer', myHtml)]
    myHtml_sub = myHtml_sub [ grep('playerid=', myHtml_sub) ]
    tmp_o=lapply(myHtml_sub, get_WAR_from_PIDrow, batting=1)
    
    #then pitchers
    prep_jc_url(team, year, 0)
    system("phantomjs scrap_final.js")
    myHtml <- readLines("1.html")
    myHtml_sub = myHtml[ grep('LeaderBoard1_dg1_ctl00', myHtml)[1]: grep('LeaderBoard1_dg1_SharedCalendarContainer', myHtml)]
    myHtml_sub = myHtml_sub [ grep('playerid=', myHtml_sub) ]
    tmp_p=lapply(myHtml_sub, get_WAR_from_PIDrow, batting=0)
    
    if(length(tmp_p) > 0)
    {
    
        tmp_o_mat = t(matrix(unlist(tmp_o), nrow  =2))
        tmp_p_mat = t(matrix(unlist(tmp_p), nrow  =2))
        
        
        #could return list, but why not just sum now and return sum? 
        df_war = data.frame(rbind(tmp_o_mat, tmp_p_mat))
        #return(df_war)
        ret = sum(as.numeric(as.character(df_war[,1])))
        return(ret)
    }
    else{return(0)}
}


prep_jc_url=function (team, year, batting)
{
     url = paste0('https://www.fangraphs.com/leaders.aspx?pos=all&stats=', ifelse(batting==1, "bat", "pit"),'&lg=all&qual=0&type=8&season=', year, '&month=0&season1=', year, '&ind=0&team=', team, '&rost=0&age=0')
     lines <- readLines("scrap_final.js")
     lines[1] <- paste0("var url ='", url ,"';")
     writeLines(lines, "scrap_final.js")
}



get_WAR_from_PIDrow = function(string, batting)
{
    #WAR
    tmp = gregexpr('<td class="grid_line_regular" align="right">', string)
    start = tmp[[1]][(length(tmp[[1]]))]
    tmp_string = substr(string, start=start, stop = start+100)
    WAR = as.numeric(split_string_on_gele(tmp_string)[ifelse(batting==1, 3, 7)])
    
    #Name
    tmp = gregexpr('playerid', string)
    start = tmp[[1]][(length(tmp[[1]]))]
    tmp_string = substr(string, start=start, stop = start+100)
    name = split_string_on_gele(tmp_string)[2]
    
    return(c(WAR, name))
}

split_string_on_gele = function(string)
{
    matrix(strsplit(string, "[><]"))[[1]]
}



FanGraphWar = matrix(nrow=30, ncol= 115)


start.time <- Sys.time()
for (team in 1:30)
{
    for(year in 1:115)
    {
        reportedTeamWAR = get_WAR_from_team_year(team, year+1902)
        FanGraphWar[team, year] = reportedTeamWAR
        print(paste(reportedTeamWAR, team, year))
    }
    
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

FGW = as.data.frame(t(FanGraphWar))

#label with franchID's 
FGW$ANA = FGW$V1 
FGW$BAL = FGW$V2 
FGW$BOS = FGW$V3 
FGW$CHW = FGW$V4 
FGW$CLE = FGW$V5 
FGW$DET = FGW$V6
FGW$KCR = FGW$V7 
FGW$MIN = FGW$V8 
FGW$NYY = FGW$V9 
FGW$OAK = FGW$V10 
FGW$SEA = FGW$V11 
FGW$TBD = FGW$V12 
FGW$TEX = FGW$V13 
FGW$TOR = FGW$V14 
FGW$ARI = FGW$V15 
FGW$ATL = FGW$V16 
FGW$CHC = FGW$V17 
FGW$CIN = FGW$V18 
FGW$COL = FGW$V19 
FGW$FLA = FGW$V20 
FGW$HOU = FGW$V21 
FGW$LAD = FGW$V22 
FGW$MIL = FGW$V23
FGW$WSN = FGW$V24 
FGW$NYM = FGW$V25 
FGW$PHI = FGW$V26
FGW$PIT = FGW$V27 
FGW$STL = FGW$V28 
FGW$SDP = FGW$V29 
FGW$SFG = FGW$V30

#make sure we only have the 30 current teams
library(Lahman)
keeps = tail(Teams$franchID, n = 30)
FGW = FGW[, (names(FGW) %in% keeps)]

#transform it to a better format

FanGraphWARs = df <- data.frame(franchID=character(), year=integer(), WAR=double(), stringsAsFactors=FALSE) 


for (team in colnames(FGW))
{
    zero_marker = 1
    for(year_idx in 1:115)
    {
        tmp = as.data.frame(x=list(franchID = team, year = year_idx + 1902, WAR = FGW[year_idx, team]))
        
        if (zero_marker == 1)
        {
            if (tmp$WAR == 0.0)
            {
                #do nothing, before team was created 
            }
            else
            {
                zero_marker = 0 #we know the franchise has been created now
                FanGraphWARs = rbind(FanGraphWARs , tmp)
            }
        }
        else
        {
            FanGraphWARs = rbind(FanGraphWARs , tmp)
        }
    }
}

write.csv(FanGraphWARs, "FanGraphWARs_better_format.csv")

Combine data sets

After getting our three datasets into csv format, the next step was to combine them into on usable data frame for our analysis. Notably, Baseball Prospectus’s data only went as far back as 1949, so we opted to only include data from 1949 from all three sources in our analyses. An additional point of note for those unfamiliar with the WAR statistic, the statistic takes into account how many wins a team of replacement players would get over the course of a season (and a player’s WAR is meant to be a prediction of how many additional wins a player provides in place of this replacement player). In other words, our calculated team total WARs are not in fact the predicted number of wins, but the predicted number of wins over a replacement team. For Baseball Reference and Fan Graph, a team of replacement players is expected to win 0.294 of its games, while Baseball Prospectus sets this line at 0.320. To get predicted wins, we mutliply the number of games a team played in a season (acquired from Lahman) by these replacement win percentages, then added the team total WAR.

#Read in BBProspectus data and remove extraneous columns
BBProspectus <- read.csv("prospectus.csv")
BBProspectus <- BBProspectus %>%
    rename(BBproWAR = WARP, teamID = franchID) %>%
    select(c(-X, -W))

#Baseball Prospectus had some funny names, so first we need to align them with our other datasets using Lahman
BBProspectus$franchID = NA

BBProspectus = BBProspectus %>%
    filter(yearID >= 1949)

for (i in 1:length(BBProspectus[, 1]))
{
    
    BBProspectus[i, ]$franchID = 
        as.character(unique(
            Teams$franchID[as.character(Teams$teamID) == as.character(BBProspectus[i, ]$teamID)])[1])
}    

BBProspectus = BBProspectus %>%
    select(c(-teamID))

#Read in BBProspectus data and remove extraneous columns
BBReference <- read.csv("BBReference.csv")
BBReference <- BBReference %>%
    filter(yearID >= 1949) %>%
    rename(BRefWAR = WAR) %>%
    select(c(-X, -source, -W))

BBReference = BBReference[!is.na(BBReference$BRefWAR), ]
#Baseball Reference Team IDs are based on the team of that year, need to correct this for a few teams
alts = c("LAA", "CAL", "TBR", "MIA", "PHA")
delt = c("ANA", "ANA", "TBD", "FLA", "OAK")

for (i in 1:length(BBReference[, 1]))
{
    if ((BBReference[i, ]$franchID %in% alts))
    {
        BBReference[i, ]$franchID =  delt[match(BBReference[i, ]$franchID, alts)]
    }
}

FanGraphs <- read.csv("Conor/FanGraphWARs_better_format.csv")
FanGraphs <- FanGraphs %>%
    rename(FanGraphWAR = WAR, yearID=year) %>%
    select(c(-X))

#Inner join the three for a complete data set
data <- inner_join(BBReference, BBProspectus, by = c("franchID", "yearID"))
data = inner_join(data, FanGraphs, by =c("franchID", "yearID"))

#Join combined data with true number of wins in Lahman's dataset to create wide data set 
team_sub <- Teams %>%
    select(yearID, franchID, W, G, lgID)  %>%
    rename(Truth = W, Games = G)
data_wide <- left_join(data, team_sub, by = c("franchID", "yearID"))

data_wide = data_wide %>%
    filter(yearID %in% 1949:2016) %>%
    mutate(BRefWAR = BRefWAR + Games * 0.294, FanGraphWAR = FanGraphWAR + Games * 0.294, BBproWAR = BBproWAR + Games*0.320) %>%
    rename(BRefPred = BRefWAR, FanGraphPred = FanGraphWAR, BBproPred = BBproWAR)

data_long <- gather(data_wide, source, WAR, -c(franchID, yearID))

Exploratory Analysis

What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?

How well does WAR compare to the true number of wins?

First, we wanted to assess at the most basic level which predictor performed the best. We defined our assessment in terms of the mean square error (MSE) between the true wins a team had during a season and the predcited wins by each of the three algorithms. In terms of MSE, we found that the Baseball Reference performed the best (MSE = 5.13), then Fan Graph (MSE = 6.23), and last was Baseball Prospectus (MSE = 9.25). Additionally, we also wanted to assess how well the different WAR statistics correlated with true wins. Ideally, we would expect slopes of 1 (i.e. each 1 additional WAR corresponds to 1 additional team win). In terms of this, Fan Graph had the best model, with a slope of 1.002, followed closely by Baseball Reference 0.986, and lastly —far behind the other two— was Baseball Prospectus at 0.692.

Next, in looking at the means differences, we see that Baseball Reference is close to 0 (as we would expect). However, both Baseball Prospectus and Fan Graphs have means above 2, and indication of possbile systematic error. In looking at the stadard deviations, we continue to see the trend of Baseball Prospectus being the worst predictor of the three, while Baseball Reference and Fan Graphs have similar standard deviations, with Baseball Reference being slightly better.

In 2013, Baseball Reference published an article about how it updated its WAR baseline wins; before, it had a different number of baseline wins for the American and National Leagues (AL was still 0.294 before, but the NL was changed from 0.320). The article also discusses how part of the reason for the change was that FanGraphs was already using 0.294 for both leagues. Based on this, we were interested in investigating if there was a difference in the MSE between American and National League teams.

statistics_matrix = matrix(nrow = 3, ncol = 4)
rownames(statistics_matrix) = c("Fan Graph", "Baseball Prospectus", "Baseball Reference")
colnames(statistics_matrix) = c("MSE", "Mean Difference", "Standard Deviation", "Slope")

FG_diff = data_wide$FanGraphPred - data_wide$Truth
statistics_matrix[1,1] = MSE(data_wide$Truth, data_wide$FanGraphPred)
statistics_matrix[1,2] = mean(FG_diff)
statistics_matrix[1,3] = sd(FG_diff)
statistics_matrix[1,4] = lm(Truth ~ FanGraphPred, data = data_wide)$coef[[2]]

BBP_diff = data_wide$BBproPred - data_wide$Truth
statistics_matrix[2,1] = MSE(data_wide$Truth, data_wide$BBproPred)
statistics_matrix[2,2] = mean(BBP_diff)
statistics_matrix[2,3] = sd(BBP_diff)
statistics_matrix[2,4] = lm(Truth ~ BBproPred, data = data_wide)$coef[[2]]

BRef_diff = data_wide$BRefPred - data_wide$Truth
statistics_matrix[3,1] = MSE(data_wide$Truth, data_wide$BRefPred)
statistics_matrix[3,2] = mean(BRef_diff)
statistics_matrix[3,3] = sd(BRef_diff)
statistics_matrix[3,4] = lm(Truth ~ BRefPred, data = data_wide)$coef[[2]]

show(statistics_matrix)
##                          MSE Mean Difference Standard Deviation     Slope
## Fan Graph           6.239231       2.5455374           5.698188 1.0017252
## Baseball Prospectus 9.256546       2.8262118           8.817408 0.6918975
## Baseball Reference  5.136446       0.2340364           5.132780 0.9863421
league_splits = data_wide %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )


diff2 = cbind(FG_diff, BBP_diff, BRef_diff)
diff2 = as.data.frame(diff2)

diff2 %>% ggplot() + 
    geom_histogram(aes(x=FG_diff, fill="Fan Graph"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BBP_diff, fill="Baseball Prospectus"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BRef_diff, fill="Baseball Reference"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    ggtitle("Difference Between Predicted and True Wins") +
    xlab("Difference") +
    ylab("Count") +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1)))

data_wide %>% ggplot(aes(x = Truth, y = FanGraphPred)) +  
    geom_point(color="red", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Fan Graph True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = BBproPred)) +  
    geom_point(color="green", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Prospectus True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot(aes(x = Truth, y = BRefPred)) +  
    geom_point(color="blue", size = 1) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("Baseball Reference True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)

##Is there a difference by Era? 

#A few Eras to consider: Deadball 2: 1960-1969
#                       DH 1973 - Present
#                       Free Agency 1975-Present
#                       Steroid Ero 1985-2005
#                       Wild Card: 1994 - Present

deadball2 = data_wide %>%
    filter(yearID %in% 1960:1969) %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )
deadball2 = cbind(deadball2, Era="Deadball")

dh = data_wide %>%
    filter(yearID %in% 1973:2016) %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )
dh = cbind(dh, Era="DH")

free_agency = data_wide %>%
    filter(yearID %in% 1975:2016) %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )
free_agency = cbind(free_agency, Era = "Free Agency")

steroid = data_wide %>%
    filter(yearID %in% 1985:2005) %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )
steroid = cbind(steroid, Era="Steroid")

wildcard = data_wide %>%
    filter(yearID %in% 1994:2016) %>%
    group_by(lgID) %>%
    summarize(MSE(Truth, FanGraphPred), MSE(Truth, BBproPred),
              MSE(Truth, BRefPred) )
wildcard = cbind(wildcard, Era="Wildcard")

era_analysis = rbind(deadball2, dh, free_agency, steroid, wildcard)

From this plot, it appears that both rWAR and WARP capture the trend of true wins over time. However, there is a substantial bias because both WAR statistics are hugely underestimating the truth.

Next, let’s build a naive model to capture the effect of WAR on predicting the truth for each of the 3 WAR statistics.

According to RMSE, rWAR is a better predictor than WARP. However, we can definitely improve upon the naive model by accounting for the substructure within our data, e.g. team effect, team x year interaction effect, etc.

data_long <- gather(data_wide, Pred, Val, -c(franchID, yearID, Truth, Games, lgID))
data_long2 <- data_long %>%
  mutate(Residuals = (as.numeric(as.character(Val)) - as.numeric(as.character(Truth))))

gg2 <- data_long2 %>%
  ggplot(aes(x = franchID, y = Residuals)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap(~ Pred, ncol = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Team Name") +
  ggtitle("")
gg2

Here, we plot the residuals (WAR preditions - true number of wins) across different teams for the 3 WAR statistics WARP (top), rWAR (middle), and fWAR (bottom). To assess degree of fit, we visually examine these boxplots, with boxplots centered around 0 suggesting good fit.

From these residual boxplots, we can see that Baseball Prospectus has the largest residuals (longest boxes) across the different teams. Moreover, it also has many outliers, which may dilute signal. On the other hand, Baseball Reference and Fangraph’s WAR statistics perform relatively well, with the boxplots centered around 0.

Final Analysis

What did you learn about the data? How did you answer the questions? How can you justify your answers?

set.seed(123)
k <- 10
#Perform k-fold cross validation
folds <- createFolds(1:nrow(data_wide), k)

#Define function for performing gradient-free optimization
fn <- function(x){
  ref <- x[1]
  pros <- x[2]
  fan <- x[3]
  pred <- ref*data_wide$BRefPred + pros*data_wide$BBproPred + fan*data_wide$FanGraphPred 
  MSE(pred, data_wide$Truth)
}

#Initialize output matrix
MSE_val <- matrix(NA, 4, k)
coeffs <- matrix(NA, 3, k)

for(i in 1:k){
  test_df <- data_wide[folds[[i]],]
  train_df <- data_wide[-folds[[i]],]
  
  m1 <- lm(Truth ~ BRefPred, data = train_df)
  m2 <- lm(Truth ~ BBproPred, data = train_df)
  m3 <- lm(Truth ~ FanGraphPred, data = train_df)
  m4 <- optim(c(1, 1, 1), fn, method = c("Nelder-Mead"))
  
  ## test prediction on test dataset
  pred1 <- predict(m1, newdata = test_df)
  pred2 <- predict(m2, newdata = test_df)
  pred3 <- predict(m3, newdata = test_df)
  
  test_df_lm <- cbind(test_df,
                      BRef_lmpred = pred1,
                      BBpro_lmpred = pred2, 
                      FanGraph_lmpred = pred3)
  
  
  MSE_val[1, i] <- MSE(test_df_lm$Truth, test_df_lm$BRef_lmpred)
  MSE_val[2, i] <- MSE(test_df_lm$Truth, test_df_lm$BBpro_lmpred)
  MSE_val[3, i] <- MSE(test_df_lm$Truth, test_df_lm$FanGraph_lmpred)
  MSE_val[4, i] <- m4$value
  for(j in 1:3){
    coeffs[j, i] <- m4$par[j]
  }
}

rownames(MSE_val) <- c("Baseball Reference", "Baseball Prospectus", "Fangraph", "Final Model")
apply(MSE_val, 1, mean)
##  Baseball Reference Baseball Prospectus            Fangraph 
##            5.130914            7.658060            5.694803 
##         Final Model 
##            4.685790
#Obtain coefficients from combined model
final_coef <- apply(coeffs, 1, mean)

#Make predictions based on combined model
data_wide <- data_wide %>%
  mutate(CombinedPred = final_coef[1]*data_wide$BRefPred + final_coef[2]*data_wide$BBproPred + final_coef[3]*data_wide$FanGraphPred )


Final_diff = data_wide$CombinedPred - data_wide$Truth
diff3 = cbind(FG_diff, BBP_diff, BRef_diff, Final_diff)
diff3 = as.data.frame(diff3)

diff3 %>% ggplot() + 
    geom_histogram(aes(x=FG_diff, fill="Fan Graph"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BBP_diff, fill="Baseball Prospectus"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    geom_histogram(aes(x=BRef_diff, fill="Baseball Reference"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
  geom_histogram(aes(x=Final_diff, fill="Final Model"), alpha = 0.2, breaks = seq(-20, 20, 0.5)) +
    ggtitle("Difference Between Predicted and True Wins") +
    xlab("Difference") +
    ylab("Count") +
    scale_fill_manual(values = c("Fan Graph" = rgb(1,0,0), "Baseball Prospectus" = rgb(0,1,0), "Baseball Reference"=rgb(0,0,1), "Final Model" = rgb(0, 0, 0)))

data_wide %>% ggplot() +
    geom_point(aes(x = Truth, y = FanGraphPred, color="Fan Graph"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BBproPred, color="Baseball Prospectus"), alpha = 0.5) + 
    geom_point(aes(x = Truth, y = BRefPred, color="Baseball Reference"), alpha = 0.5) +
    geom_point(aes(x = Truth, y = CombinedPred, color = "Final Model"), alpha = 0.5) +
    scale_fill_manual(values = c("Fan Graph" = rgb(1, 0, 0), "Baseball Prospectus" = rgb(0, 1, 0), "Baseball Reference"=rgb(0, 0, 1), "Final Model" = rgb(0, 0, 0))) +
    geom_abline(slope = 1, intercept = 0) +
    ggtitle("True v Predicted Wins") +
    xlab("True Wins") +
    ylab("Predicted Wins") +
    xlim(40, 125) +
    ylim(40, 125)